Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 12 de 12
Filtrar
1.
IEEE Trans Vis Comput Graph ; 30(2): 1564-1578, 2024 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-37159326

RESUMO

Large tree structures are ubiquitous and real-world relational datasets often have information associated with nodes (e.g., labels or other attributes) and edges (e.g., weights or distances) that need to be communicated to the viewers. Yet, scalable, easy to read tree layouts are difficult to achieve. We consider tree layouts to be readable if they meet some basic requirements: node labels should not overlap, edges should not cross, edge lengths should be preserved, and the output should be compact. There are many algorithms for drawing trees, although very few take node labels or edge lengths into account, and none optimizes all requirements above. With this in mind, we propose a new scalable method for readable tree layouts. The algorithm guarantees that the layout has no edge crossings and no label overlaps, and optimizes one of the remaining aspects: desired edge lengths and compactness. We evaluate the performance of the new algorithm by comparison with related earlier approaches using several real-world datasets, ranging from a few thousand nodes to hundreds of thousands of nodes. Tree layout algorithms can be used to visualize large general graphs, by extracting a hierarchy of progressively larger trees. We illustrate this functionality by presenting several map-like visualizations generated by the new tree layout algorithm.

2.
Nature ; 622(7983): 594-602, 2023 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-37821698

RESUMO

Metagenomes encode an enormous diversity of proteins, reflecting a multiplicity of functions and activities1,2. Exploration of this vast sequence space has been limited to a comparative analysis against reference microbial genomes and protein families derived from those genomes. Here, to examine the scale of yet untapped functional diversity beyond what is currently possible through the lens of reference genomes, we develop a computational approach to generate reference-free protein families from the sequence space in metagenomes. We analyse 26,931 metagenomes and identify 1.17 billion protein sequences longer than 35 amino acids with no similarity to any sequences from 102,491 reference genomes or the Pfam database3. Using massively parallel graph-based clustering, we group these proteins into 106,198 novel sequence clusters with more than 100 members, doubling the number of protein families obtained from the reference genomes clustered using the same approach. We annotate these families on the basis of their taxonomic, habitat, geographical and gene neighbourhood distributions and, where sufficient sequence diversity is available, predict protein three-dimensional models, revealing novel structures. Overall, our results uncover an enormously diverse functional space, highlighting the importance of further exploring the microbial functional dark matter.


Assuntos
Metagenoma , Metagenômica , Microbiologia , Proteínas , Análise por Conglomerados , Metagenoma/genética , Metagenômica/métodos , Proteínas/química , Proteínas/classificação , Proteínas/genética , Bases de Dados de Proteínas , Conformação Proteica
3.
ACM BCB ; 20222022 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-35960866

RESUMO

Clinical EHR data is naturally heterogeneous, where it contains abundant sub-phenotype. Such diversity creates challenges for outcome prediction using a machine learning model since it leads to high intra-class variance. To address this issue, we propose a supervised pre-training model with a unique embedded k-nearest-neighbor positive sampling strategy. We demonstrate the enhanced performance value of this framework theoretically and show that it yields highly competitive experimental results in predicting patient mortality in real-world COVID-19 EHR data with a total of over 7,000 patients admitted to a large, urban health system. Our method achieves a better AUROC prediction score of 0.872, which outperforms the alternative pre-training models and traditional machine learning methods. Additionally, our method performs much better when the training data size is small (345 training instances).

4.
Patterns (N Y) ; 2(12): 100389, 2021 Dec 10.
Artigo em Inglês | MEDLINE | ID: mdl-34723227

RESUMO

Deep learning (DL) models typically require large-scale, balanced training data to be robust, generalizable, and effective in the context of healthcare. This has been a major issue for developing DL models for the coronavirus disease 2019 (COVID-19) pandemic, where data are highly class imbalanced. Conventional approaches in DL use cross-entropy loss (CEL), which often suffers from poor margin classification. We show that contrastive loss (CL) improves the performance of CEL, especially in imbalanced electronic health records (EHR) data for COVID-19 analyses. We use a diverse EHR dataset to predict three outcomes: mortality, intubation, and intensive care unit (ICU) transfer in hospitalized COVID-19 patients over multiple time windows. To compare the performance of CEL and CL, models are tested on the full dataset and a restricted dataset. CL models consistently outperform CEL models, with differences ranging from 0.04 to 0.15 for area under the precision and recall curve (AUPRC) and 0.05 to 0.1 for area under the receiver-operating characteristic curve (AUROC).

5.
Sci Rep ; 11(1): 15285, 2021 07 27.
Artigo em Inglês | MEDLINE | ID: mdl-34315936

RESUMO

This study examined how people choose their path to a target, and the visual information they use for path planning. Participants avoided stepping outside an avoidance margin between a stationary obstacle and the edge of a walkway as they walked to a bookcase and picked up a target from different locations on a shelf. We provided an integrated explanation for path selection by combining avoidance margin, deviation angle, and distance to the obstacle. We found that the combination of right and left avoidance margins accounted for 26%, deviation angle accounted for 39%, and distance to the obstacle accounted for 35% of the variability in decisions about the direction taken to circumvent an obstacle on the way to a target. Gaze analysis findings showed that participants directed their gaze to minimize the uncertainty involved in successful task performance and that gaze sequence changed with obstacle location. In some cases, participants chose to circumvent the obstacle on a side for which the gaze time was shorter, and the path was longer than for the opposite side. Our results of a path selection judgment test showed that the threshold for participants abandoning their preferred side for circumventing the obstacle was a target location of 15 cm to the left of the bookcase shelf center.

6.
Exp Appl Acarol ; 84(3): 607-622, 2021 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-34148204

RESUMO

Smartphone cameras and digital devices are increasingly used in the capture of tick images by the public as citizen scientists, and rapid advances in deep learning and computer vision has enabled brand new image recognition models to be trained. However, there is currently no web-based or mobile application that supports automated classification of tick images. The purpose of this study was to compare the accuracy of a deep learning model pre-trained with millions of annotated images in Imagenet, against a shallow custom-build convolutional neural network (CNN) model for the classification of common hard ticks present in anthropic areas from northeastern USA. We created a dataset of approximately 2000 images of four tick species (Ixodes scapularis, Dermacentor variabilis, Amblyomma americanum and Haemaphysalis sp.), two sexes (male, female) and two life stages (adult, nymph). We used these tick images to train two separate CNN models - ResNet-50 and a simple shallow custom-built. We evaluated our models' performance on an independent subset of tick images not seen during training. Compared to the ResNet-50 model, the small shallow custom-built model had higher training (99.7%) and validation (99.1%) accuracies. When tested with new tick image data, the shallow custom-built model yielded higher mean prediction accuracy (80%), greater confidence of true detection (88.7%) and lower mean response time (3.64 s). These results demonstrate that, with limited data size for model training, a simple shallow custom-built CNN model has great prospects for use in the classification of common hard ticks present in anthropic areas from northeastern USA.


Assuntos
Ixodes , Ixodidae , Amblyomma , Animais , Feminino , Masculino , Redes Neurais de Computação , Ninfa
7.
IEEE Trans Big Data ; 7(1): 38-44, 2021 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-33768136

RESUMO

Traditional Machine Learning (ML) models have had limited success in predicting Coronoavirus-19 (COVID-19) outcomes using Electronic Health Record (EHR) data partially due to not effectively capturing the inter-connectivity patterns between various data modalities. In this work, we propose a novel framework that utilizes relational learning based on a heterogeneous graph model (HGM) for predicting mortality at different time windows in COVID-19 patients within the intensive care unit (ICU). We utilize the EHRs of one of the largest and most diverse patient populations across five hospitals in major health system in New York City. In our model, we use an LSTM for processing time varying patient data and apply our proposed relational learning strategy in the final output layer along with other static features. Here, we replace the traditional softmax layer with a Skip-Gram relational learning strategy to compare the similarity between a patient and outcome embedding representation. We demonstrate that the construction of a HGM can robustly learn the patterns classifying patient representations of outcomes through leveraging patterns within the embeddings of similar patients. Our experimental results show that our relational learning-based HGM model achieves higher area under the receiver operating characteristic curve (auROC) than both comparator models in all prediction time windows, with dramatic improvements to recall.

8.
Philos Trans A Math Phys Eng Sci ; 378(2166): 20190394, 2020 Mar 06.
Artigo em Inglês | MEDLINE | ID: mdl-31955674

RESUMO

Genomic datasets are growing dramatically as the cost of sequencing continues to decline and small sequencing devices become available. Enormous community databases store and share these data with the research community, but some of these genomic data analysis problems require large-scale computational platforms to meet both the memory and computational requirements. These applications differ from scientific simulations that dominate the workload on high-end parallel systems today and place different requirements on programming support, software libraries and parallel architectural design. For example, they involve irregular communication patterns such as asynchronous updates to shared data structures. We consider several problems in high-performance genomics analysis, including alignment, profiling, clustering and assembly for both single genomes and metagenomes. We identify some of the common computational patterns or 'motifs' that help inform parallelization strategies and compare our motifs to some of the established lists, arguing that at least two key patterns, sorting and hashing, are missing. This article is part of a discussion meeting issue 'Numerical algorithms for high-performance computational science'.

9.
Nucleic Acids Res ; 46(6): e33, 2018 04 06.
Artigo em Inglês | MEDLINE | ID: mdl-29315405

RESUMO

Biological networks capture structural or functional properties of relevant entities such as molecules, proteins or genes. Characteristic examples are gene expression networks or protein-protein interaction networks, which hold information about functional affinities or structural similarities. Such networks have been expanding in size due to increasing scale and abundance of biological data. While various clustering algorithms have been proposed to find highly connected regions, Markov Clustering (MCL) has been one of the most successful approaches to cluster sequence similarity or expression networks. Despite its popularity, MCL's scalability to cluster large datasets still remains a bottleneck due to high running times and memory demands. Here, we present High-performance MCL (HipMCL), a parallel implementation of the original MCL algorithm that can run on distributed-memory computers. We show that HipMCL can efficiently utilize 2000 compute nodes and cluster a network of ∼70 million nodes with ∼68 billion edges in ∼2.4 h. By exploiting distributed-memory environments, HipMCL clusters large-scale networks several orders of magnitude faster than MCL and enables clustering of even bigger networks. HipMCL is based on MPI and OpenMP and is freely available under a modified BSD license.


Assuntos
Algoritmos , Análise por Conglomerados , Biologia Computacional/métodos , Redes Reguladoras de Genes , Cadeias de Markov , Expressão Gênica , Mapas de Interação de Proteínas/genética
10.
Front Oncol ; 6: 188, 2016.
Artigo em Inglês | MEDLINE | ID: mdl-27630823

RESUMO

We describe algorithms for discovering immunophenotypes from large collections of flow cytometry samples and using them to organize the samples into a hierarchy based on phenotypic similarity. The hierarchical organization is helpful for effective and robust cytometry data mining, including the creation of collections of cell populations' characteristic of different classes of samples, robust classification, and anomaly detection. We summarize a set of samples belonging to a biological class or category with a statistically derived template for the class. Whereas individual samples are represented in terms of their cell populations (clusters), a template consists of generic meta-populations (a group of homogeneous cell populations obtained from the samples in a class) that describe key phenotypes shared among all those samples. We organize an FC data collection in a hierarchical data structure that supports the identification of immunophenotypes relevant to clinical diagnosis. A robust template-based classification scheme is also developed, but our primary focus is in the discovery of phenotypic signatures and inter-sample relationships in an FC data collection. This collective analysis approach is more efficient and robust since templates describe phenotypic signatures common to cell populations in several samples while ignoring noise and small sample-specific variations. We have applied the template-based scheme to analyze several datasets, including one representing a healthy immune system and one of acute myeloid leukemia (AML) samples. The last task is challenging due to the phenotypic heterogeneity of the several subtypes of AML. However, we identified thirteen immunophenotypes corresponding to subtypes of AML and were able to distinguish acute promyelocytic leukemia (APL) samples with the markers provided. Clinically, this is helpful since APL has a different treatment regimen from other subtypes of AML. Core algorithms used in our data analysis are available in the flowMatch package at www.bioconductor.org. It has been downloaded nearly 6,000 times since 2014.

11.
BMC Bioinformatics ; 17: 291, 2016 Jul 28.
Artigo em Inglês | MEDLINE | ID: mdl-27465477

RESUMO

BACKGROUND: Comparing phenotypes of heterogeneous cell populations from multiple biological conditions is at the heart of scientific discovery based on flow cytometry (FC). When the biological signal is measured by the average expression of a biomarker, standard statistical methods require that variance be approximately stabilized in populations to be compared. Since the mean and variance of a cell population are often correlated in fluorescence-based FC measurements, a preprocessing step is needed to stabilize the within-population variances. RESULTS: We present a variance-stabilization algorithm, called flowVS, that removes the mean-variance correlations from cell populations identified in each fluorescence channel. flowVS transforms each channel from all samples of a data set by the inverse hyperbolic sine (asinh) transformation. For each channel, the parameters of the transformation are optimally selected by Bartlett's likelihood-ratio test so that the populations attain homogeneous variances. The optimum parameters are then used to transform the corresponding channels in every sample. flowVS is therefore an explicit variance-stabilization method that stabilizes within-population variances in each channel by evaluating the homoskedasticity of clusters with a likelihood-ratio test. With two publicly available datasets, we show that flowVS removes the mean-variance dependence from raw FC data and makes the within-population variance relatively homogeneous. We demonstrate that alternative transformation techniques such as flowTrans, flowScape, logicle, and FCSTrans might not stabilize variance. Besides flow cytometry, flowVS can also be applied to stabilize variance in microarray data. With a publicly available data set we demonstrate that flowVS performs as well as the VSN software, a state-of-the-art approach developed for microarrays. CONCLUSIONS: The homogeneity of variance in cell populations across FC samples is desirable when extracting features uniformly and comparing cell populations with different levels of marker expressions. The newly developed flowVS algorithm solves the variance-stabilization problem in FC and microarrays by optimally transforming data with the help of Bartlett's likelihood-ratio test. On two publicly available FC datasets, flowVS stabilizes within-population variances more evenly than the available transformation and normalization techniques. flowVS-based variance stabilization can help in performing comparison and alignment of phenotypically identical cell populations across different samples. flowVS and the datasets used in this paper are publicly available in Bioconductor.


Assuntos
Algoritmos , Citometria de Fluxo , Análise de Variância , Antígenos CD/metabolismo , Humanos , Linfócitos/citologia , Linfócitos/metabolismo
12.
BMC Bioinformatics ; 13 Suppl 2: S10, 2012 Mar 13.
Artigo em Inglês | MEDLINE | ID: mdl-22536861

RESUMO

BACKGROUND: When flow cytometric data on mixtures of cell populations are collected from samples under different experimental conditions, computational methods are needed (a) to classify the samples into similar groups, and (b) to characterize the changes within the corresponding populations due to the different conditions. Manual inspection has been used in the past to study such changes, but high-dimensional experiments necessitate developing new computational approaches to this problem. A robust solution to this problem is to construct distinct templates to summarize all samples from a class, and then to compare these templates to study the changes across classes or conditions. RESULTS: We designed a hierarchical algorithm, flowMatch, to first match the corresponding clusters across samples for producing robust meta-clusters, and to then construct a high-dimensional template as a collection of meta-clusters for each class of samples. We applied the algorithm on flow cytometry data obtained from human blood cells before and after stimulation with anti-CD3 monoclonal antibody, which is reported to change phosphorylation responses of memory and naive T cells. The flowMatch algorithm is able to construct representative templates from the samples before and after stimulation, and to match corresponding meta-clusters across templates. The templates of the pre-stimulation and post-stimulation data corresponding to memory and naive T cell populations clearly show, at the level of the meta-clusters, the overall phosphorylation shift due to the stimulation. CONCLUSIONS: We concisely represent each class of samples by a template consisting of a collection of meta-clusters (representative abstract populations). Using flowMatch, the meta-clusters across samples can be matched to assess overall differences among the samples of various phenotypes or time-points.


Assuntos
Algoritmos , Citometria de Fluxo , Receptores de Antígenos de Linfócitos T/metabolismo , Linfócitos T/imunologia , Humanos , Fosforilação
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA